-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[main] Add new method to track originating cluster of Backup #613
base: main
Are you sure you want to change the base?
Conversation
4967e19
to
f0b6b30
Compare
4040156
to
9ac7ecb
Compare
@jbiers - So i'm not moving it to in review yet, since this was a very rough first pass try. However if you can take some time to get familiar with the changes and review the list of concerns I thought of (as well as consider/add your own) - that'd be great help to keep the ball rolling on this. In other words, while I could work on this more I'd love your feedback first before I spend more time on it. |
@alexandreLamarre - I've added you as a reviewer her as I'd like to get your take on this. When you have time LMK if you'd like to sync up and chat about in a huddle. Or feel free to review in your own time and give me feedback here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly some small comments about making the code more simple,readable & testable.
Since this PR could be a stepping stone to fully allowing backups of backup CRDs themselves, storing the cluster origin metadata in the Status feels wrong.
But if we do intend to fully support taking backups of backup CRDs, then even storing the cluster origin metadata in the Spec section of the CRD feels wrong.
What happens when we have a list of remote backups say A
, B
, C
, all taken from different origins?
If the current cluster matches origin C
then our status will indicate the restore is viable even if they specify an older backup A
or B
. I know this hypothetical would be rare but I think it points to the fact that maybe we should think about this differently.
The current design labels the recurring backups as coming from this cluster only, whereas specific backups are specific to specific clusters
What if we embedded cluster origin metadata not in the CRDs, but the backup data itself?
This way the flow looks like
- Each remote backup contains a cluster origin
- Each restore pointing to a specific backup can determine before actually restoring objects if it is viable
- We can add a flag to the restore spec to allow
unsafe
backups, ones where the cluster origin doesn't match - By default, we disallow cases where we could have unsafe restores
- If the restore is unsafe, the status is surfaced through the restore CR
- Users can then modify the restore CR to restore the backup even if it is unsafe after the fact, or opt in before creating the restore CR
From the end user's perspective it is somewhat clunkier to take a restore, then be informed that it is potentially unsafe, but as an end user I value being shown correct information more
Additionally storing, the cluster origin metadata in the backup data, allows us to add tests for each corner case in the e2e tests when taking a restore, producing a potentially less volatile new feature
backupName string | ||
hasClusterOriginID bool | ||
clusterOriginID string | ||
hasCurrentOriginCondition bool | ||
currentOriginCondition bool | ||
canInPlaceRestore bool | ||
hasInPlaceRestoreCondition bool | ||
currentInPlaceRestoreCondition bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having boolean fields injected during construction, could encapsulate
type backupClusterMdChecker struct{
backupName string
clusterOrigin string
b *backupv1.Backup
// Any controllers or client runtime needed for methods, if any
}
func (b backupClusterMd) hasClusterOriginID() bool {
// ...
}
func (b backupClusterMd) hasCurrentOriginCondition() bool {
// ...
}
This allows us to isolate some more complicated logic, as well as allowing for testing if we make these methods or struct public, which we probably should - because we should be focused on stability.
If making them public, could consider moving them to sub-package, pkg/controllers/backup/origin
or something to prevent imports of heavy k8s dependencies in unit tests.
Also greatly simplifies the if statements below IMO.
dynamicClient: dynamicInterface, | ||
defaultBackupMountPath: defaultLocalBackupLocation, | ||
defaultS3BackupLocation: defaultS3, | ||
canUseClusterOriginStatus: util.VerifyBackupCrdHasClusterStatus(clientSet.ApiextensionsV1()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would avoid having a function returning a value that uses API calls being passed to the handler. Instead would include the logic inside the function in the setup itself, before the call to Register
See comment below as well.
clusterOriginChanged := h.prepareClusterOriginConditions(backup) | ||
if clusterOriginChanged { | ||
shouldUpdateStatus = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : can simplify using
if originChanged := h.prepareClusterOriginConditions(backup); originChanged{
shouldUpdateStatus = true
}
if err != nil { | ||
logrus.Infof("Error fetching CRD: %v", err) | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a more ready version of this, I'd expect any errors like this to be surfaced in the setup step. And potentially exit since an error like that signifies a more serious problem or a transient error (in this case could consider a backoff mechanism)
@@ -37,6 +39,7 @@ type BackupSpec struct { | |||
|
|||
type BackupStatus struct { | |||
Conditions []genericcondition.GenericCondition `json:"conditions"` | |||
OriginCluster string `json:"originCluster,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced the origin cluster metadata should be stored in the Status field.
I can see from the original CRD definition that the semantic difference between spec (the desired state of the object) and the status (what is the current state of the object) is a little bit muddied in this operator.
IMO, the origin cluster (and additional metadata) could be set in the Spec and be fully managed by the operator
In any case, the backup CRD's status should be the one surfacing whether or not a backup is viable for sure 👍
crd, err := getCRDDefinition(client, crdName) | ||
if err != nil { | ||
logrus.Infof("Error fetching CRD: %v", err) | ||
return false | ||
} | ||
|
||
// Inspect the status schema, for example | ||
_, found := crd.Spec.Versions[0].Schema.OpenAPIV3Schema.Properties["status"].Properties["originCluster"] | ||
if found { | ||
logrus.Debugf("Status schema contains `originCluster` on CRD `%s`.\n", crdName) | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back to the above discussion, storing the cluster metadata in the backup CRD spec as a nullable obejct shouldn't be a breaking change, and we can instead check if it is nil or not, rather than doing dynamic CRD reflection
IMO it is easier to make it testable if we avoid checking crd definitions and focus on the presence/absence of a new field to determine that.
} | ||
|
||
// prepareClusterOriginConditions helps set the cluster origin conditions and reports if anything changed in this part of status. | ||
func (h *handler) prepareClusterOriginConditions(backup *v1.Backup) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of this logic in this function can also be simplified/more readable if we encapsulate most of it in struct methods
kubeSystemNS string | ||
// TODO: rename to kubeSystemNamespaceUID; nit to improve clarity, it's not the string representation nor the NS resource | ||
kubeSystemNS string | ||
canUseClusterOriginStatus bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stylistically, should avoid storing additional boolean fields, in favor of a method.
I think this field might be useless?
If we opt into using the backup CRD spec for storing cluster metadata, and older backup-restore versions don't set this new null-able field, then we cannot determine where it is from.
currentOriginConditionString := condition.Cond(v1.BackupConditionClusterOrigin).GetStatus(backup) | ||
if currentOriginConditionString != "False" { | ||
condition.Cond(v1.BackupConditionClusterOrigin).SetStatusBool(backup, false) | ||
condition.Cond(v1.BackupConditionClusterOrigin).Message(backup, "CRD not updated to include cluster UID yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this realistically happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, at least I think so. Granted it was during devworkflows, but I did run into it myself.
Maybe not as big a concern in prod, but I think could still happen - in a chart like BRO if they don't also update the CRDs. I.e. Installed via helm CLI directly.
Or even if (an edge case) where a user manually targets a newer image than the chart. However if we opt to not change the CRD to add it to status this wouldn't be needed anyway.
This partial solves for #612
However it's important to note that while I wrote an RFE for this too - the root here is seeking to fix a bug. The bug is that "BRO itself doesn't do anything to give indications when In-Place Restore is acceptable to use".
The short version of the fix, is we add a few new status details and expose them via
kubectl
. And later make a change torancher/dashboard
to ensure UI can also benefit from this new context by adding a column.To be discussed:
2.10.x
release?k8s
seems to be OK with full on schema additions (ofspec
too) as long as they're optional/nullalble. This is hopefully less impactful since users cannot controlstatus
.originCluster
not existing on status,